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CONNECTION  MACHINE  SOFTWARE  CONVERSION 
OF  THE  NAVY  TOPS  MODEL 


1  INTRODUCTION 


Models  which  solve  a  set  of  partial  differential  equations  form  a  large  and  important  category 
of  scientific  applications.  These  applications  axe  commonly  structured  to  run  well  on  vectorizing 
machines  such  as  the  Cray  Y-MP  and  the  Convex  C  series. 

The  introduction  of  highly  parallel  machines  [1,  2]  with  peak  performance  significantly  exceed¬ 
ing  the  Cray  machines  has  sparked  interest  in  running  scientific  models  on  these  new  machine 
architectures.  The  demonstration  over  the  past  few  years  of  many  models  restructured  success¬ 
fully  for  these  machines  has  led  to  growing  interest  in  code  conversion.  This  is  in  part  due  to 
the  widespread  belief  that  economic  factors,  principally  the  levera^ng  of  commodity  micropro¬ 
cessor  and  memory  technology,  will  make  highly  parallel  machines  more  cost-elective  than  vector 
architectures. 

Many  easting  codes  reflect  decades  of  optinuzaUon  for  sequential  processing.  Achieving  max¬ 
imum  parallelism  from  such  a  starting  pcunt  can  be  difiKcult.  An  attractive  alternative  to  code 
conversion  is  development  of  a  new  model  from  the  basic  mathematical  formulation.  This  is  ap¬ 
pealing  as  a  way  to  build  a  model  taking  maximum  advantage  of  the  machine  potential,  both  from 
a  coding  and  an  algorithmic  pmnt  of  view.  It  also  affords  an  opportunity  to  revise  the  formulation 
to  incorporate  newer  ideas  or  to  rectify  deficiendes. 

Code  conversion  was  chosen  rather  than  re-development  for  two  reasons.  Fintly,  the  cost  of 
convernon  is  lower  than  the  cost  of  re-devdopment.  Previous  experiences  with  similar  models  has 
clearly  shown  that  the  conversion  cost  is  not  typically  high.  The  mapping  of  a  finite  difference  grid 
point  modd  to  an  MPP  is  not  espedaUy  challenging.  Secondly,  an  existing  modd  often  has  a  long 
history  of  operational  use  or  has  been  comptehensivdy  validated  at  some  point.  The  operational 
history  or  validation  gives  operational  confidence  in  the  existing  modd  predictions  which  a  new 
modd  would  lack. 

This  document  describes  the  conversion  of  the  Navy  modd  called  TOPS,  the  Thermodynamic 
Oceanographic  Prediction  System.  The  conversion  started  with  a  serial  version  in  Fortran  77  and 
ended  with  a  Fortran  90  [4]  version  for  the  Connection  Madiine  CM-S.  Data  mapping,  conversion 
jjMwitig,  and  performance  pmnts  of  view  are  considered. 


iisBMScfipt  approved  /iifjl  21,  1994. 
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2  BACKGROUND 


This  conversion  effort  is  part  of  a  series  of  conversions  for  ocean  and  atmospheric  models[6, 
7].  The  approach  taken  for  TOPS  is  similar  to  that  taken  for  OCEANS  and  differs  from  the 
approach  used  in  [6].  The  TOPS  model  structure  required  substantially  greater  effort  to  convert 
than  OCEANS  due  to  software  structural  features  which  needed  to  be  changed. 

The  starting  Fortrsm  77  version  of  TOPS  is  a  generic  code  with  a  generic  test  case.  There 
are  many  provisions  in  the  code  for  alternative  input  or  output  files  and  hooks  for  the  addition  of 
additional  history  variables  beyond  TOPS’  basic  set  of  two  current  components,  temperature,  and 
salinity. 


3  DATA  LAYOUT 

Finite  difference  models  of  ocean  or  atmospheric  variables  tend  to  have  simple  mappings  to 
the  CM-5.  Often  a  set  of  simultaneous  equations  must  be  solved  and  this  can  affect  the  choice  of 
memory  layout. 

In  the  case  of  TOPS,  tridiagonal  systems  are  solved  in  routine  TRID,  called  within  the  routine 
UPDiT2.  There  are  many  tridiagonal  systems,  one  for  each  vertical  column  and  for  each  variable. 
AH  of  the  systems  for  a  ^ven  variable  may  be  solved  in  parallel. 


3.1  Layout  Principles 

The  layout  of  data  in  an  MPP  is  often  the  single  most  important  factor  in  achieving  good  per¬ 
formance.  Layout  is  a  global  decision  in  the  sense  that  the  best  layout  of  data  to  optimize  the 
performance  of  a  routine  or  group  of  routines  may  not  be  the  best  to  optimize  any  one  operation 
in  the  code.  For  a  compiler  to  make  a  reasonable  choice,  it  would  have  to  consider  more  than  a 
single  statement  or  loop  nest  and  it  would  need  knowledge  of  the  iterative  structures  within  and 
among  routines.  It  is  primarily  for  these  reasons  that  MPP  vendors  have  chosen  to  avoid  the  data 
la3ront  problem  and  forced  the  programmer  to  choose. 

Like  atmospheric  modeb,  ocean  models  tend  to  be  more  computationally  complex  along  vertical 
columns  than  in  horizontal  directions.  It  is  therefore  common  that  vertical  columns  be  allocated 
completely  within  a  single  processor.  This  course  of  action  was  taken  in  the  model  conversion. 

The  basic  shape  of  data  in  TOPS  is  three  dimensional,  two  horizontal  dimensions,  x  and  y,  and 
a  vertical  dimension  z.  The  indices  of  z  increase  from  the  surface  downwards.  There  are  two  sets 
of  values  corresponding  to  two  time  steps  of  the  four  basic  variables. 

The  serial  TOPS  code  approached  data  layout  from  the  point  of  view  of  minimizing  memory 
usage.  A  “slab”  processing  approach  was  used.  Figure  1  shows  how  the  method  was  implemented 
fat  TOPS.  The  main  time  step  loop  contains  an  inner  slab  loop  which  sequences  through  lower- 
dimensional  sections  of  the  history  variables.  In  the  case  of  TOPS,  sections  corresponding  to  x-z 
planes  are  taken.  The  slab  loop  must  preserve  the  context  needed  to  process  the  individual  slabs 
in  the  y  direction.  This  introduces  an  asymmetry  in  processing  along  the  y  direction  which  is  not 
present  in  the  mathematical  formulation. 
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Fig.  1  —  Slab  Pioccttiiig 


The  parallel  version  of  TOPS  does  not  use  the  slab  approach  because  its  low  level  of  parallelism 
is  insufficient  for  (and  inefficient  on)  the  Connection  Machine.  This  fundamental  change  in  the 
software  caused  most  arrays  shapes  in  lower  level  routines  to  change. 

Current  ThinMng  Maclunes  compilers  require  that  all  serial  axes  precede  the  parallel  ones. 
This  restriction  is  contrary  to  the  normal,  and  most  consistent,  usage  patterns.  To  satisfy  these 
restrictions,  the  subscripts  of  the  main  arrays  in  the  model  were  reordered  to  place  the  last  three 
axes,  levri,  time  step,  and  variable  first.  In  the  present  compiler,  this  does  not  result  in  efficient 
passing  of  three-dimensional  sections  to  the  low  level  computation  routines. 

Since  the  completion  of  the  ori^al  TOPS  port.  Thinking  Machines  Corporation  compilers 
have  been  changed  to  store  arrays  in  a  more  standard  form.  Some  of  the  current  problems  with 
array  batniHfig  can  be  removed  at  the  cost  of  again  revising  the  code  in  a  systematic  way. 


3.2  Major  Variables 

Figures  2  through  5  show  the  shapes  of  major  variables  ori^ally  and  after  conversion.  Note  that 
these  fffnfw  that  serial  axes  last  are  acceptable.  Since  the  conversion  took  place  before 

the  oomriler  restriction  was  lifted,  aU  serial  axes  of  parallel  arrays  have  been  moved  to  the  left 
side,  in  the  same  order.  Axes  dimensioned  (n.a)  represent  the  x  and  y  axes.  All  arrays  with  the 
OQmlnnatkm  of  axes  (n,>)  are  lEVS  parallel.  The  1  axis  always  represents  the  vertical  direction  and 
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such  axes  are  always  allocated  serially,  either  as  completely  serial  arrays  or  serial  within  processor 
if  the  array  is  a  parallel  one  (i.e.,  has  the  n  and  m  axes).  The  axis  nt  ranges  over  time  steps  (2),  and 
the  nr  axis  ranges  over  variables  (the  4  basic  u,  v,  t,  and  s  plus  the  potential  for  others).  History 
variables  are  stored  in  a  single  array,  r(n.m.l.nt.nr). 


Original  Variable 

New  Variable 

Notes 

dxbCm) 

dxb(n,m) 

dxm(m) 

dxmCn.m) 

Spread 

dxbrCn) 

dxbr(n,n) 

Calculated  from  dxb 

dxmrCm) 

dxnrCn.a) 

Calculated  from  dxm 

da(m) 

daCn.n) 

Spread 

dxbdalCm) 

dxbdaKn.m) 

Calculated 

dxbda2(n) 

dxbda2(n.m) 

Calculated 

dydarCm) 

dydar(n,m) 

Spread 

2b(l) 

zbCl) 

(Serial) 

znCl) 

za(l) 

(Serial) 

dzb(l) 

dzb(l) 

(Serial) 

dzm(l) 

dsmCl) 

(Serial) 

dzbrCl) 

dzbr(l) 

(Serial) 

dznr(l) 

dznr(l) 

(Serial) 

olong(n) 

along  (n.n) 

Spread 

alatfa) 

alat(n,B) 

Spread 

Pi*.  3  —  TOPS  Amjra — Grid  Puameten 


Bask  (n, a) 

ZBaekCn.B) 

yaaskCn.a) 

mask  (n, a) 

xaaskCn.a) 

yaaakCn.a) 

Fi*.  S  —  TOPS  Axnys — Mask  Paxsmeten 


The  notes  in  figores  2  throogh  5  indicate  some  arrays  were  spread.  This  means  that  a  new  axis 
has  been  introduced  and  the  values  along  the  new  axis  are  copies  of  the  original  array  values.  A 
similar  situation  exists  for  arrays  with  the  notation  calculated  but  in  these  cases,  the  redundant 
values  were  computed  rather  than  copied.  In  all  of  these  cases,  the  redundant  values  were  intro¬ 
duced  to  eliminate  the  need  to  communicate  the  non-redundant  set.  These  variables  all  relate  to 
grid  parameters  and  do  not  have  a  vertical  1  axis.  The  space  taken  up  by  the  redundant  values  is 
not  significant.  The  other  notation  used  in  the  figures  is  collected.  This  means  that  the  m  axis  cor¬ 
responding  to  the  slabs  has  been  added  and  the  ^ues  along  this  axis  correspond  to  the  succession 
of  values  taken  by  the  original  array  inside  the  slab  loop. 

The  oripnal  code  used  a  set  of  temporaries,  again  to  conserve  space.  These  temporaries  were 
passed  to  lower  levd  routines  as  work  arrays.  This  practice  is  not  effective  on  the  Connection  Ma¬ 
chine  because  of  the  implicit  equivalendng  that  is  sometimes  involved.  The  parallel  code  dispenses 
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Original  Variable 

New  Variable 

f  (a) 

istypeCntyp) 
iptypa(n.B) 
qrf (l.ntyp) 
xk(2) 
yk(2) 

2k  (2) 

xkdzr(n.B,2) 

yk<l7r(n,B.2) 

zkdzr(l,2) 

f  (n.m) 

iatype(ntyp) 

iptype(n,m) 

qrf (l.ntyp) 

xk(2) 

yk(2) 

zk(2) 

zkdzr(n,B.2) 

ykdyr(n.B.2) 

zkdzr(l,2) 

rCn.a.l.nt.sr) 

r(n.a.l.nt.nx) 

tlaCn.a) 

kt(n,m) 

tlB(n,B) 

kt(n.D) 

Notes 


Fig.  4  —  TOPS  Anayt — Problem  Puameteit  (Put  1) 


Original  Variable 

New  Variable 

Notes 

rgbCn.B.nr) 

rgsCn.nr) 

qrCn) 

rbgt(n.B,3,jir>2) 

ugCn.l) 

▼g(n.2,l) 

u(n.l) 

v(n,2.1} 

wCn.l) 

rgb(n.B,nr) 

rgsCn.B.nr) 

qr(xi.B} 

rbgt(n.B,3.nr-2) 

ug(n.B,l) 

▼g(ll,B.l) 

tt(n,B.l} 

v(n.B,2.1) 

«(n.B.l} 

Collected 

Collected 

Collected 

pl(n.2.1} 

p2(n.2.1} 

plCn.B.l) 

p2(n,a,l) 

Collected 

Collected 

fly(n.2.1,iir) 

suBCn.l.nr) 

r2(n.l,iir) 

fly(n,B,l,nr) 

suBCn.B.l.nr) 

r2(n.B,l.nr) 

Collected 

Collected 

Collected 

Pig.  5  —  TOPS  Aiieye — ^Problem  Puuaeten  (Put  3) 
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with  many  of  these  temporaries  in  the  new  calling  sequences  but  many  instances  not  causing  equiv¬ 
alence  problems  still  remain.  Needed  temporaries  are  often  dynamically  allocated  within  the  lower 
level  routines,  resulting  in  cleaner  interfaces  and  higher  quality  code.  The  practice  of  allocating 
temporaries  in  the  lower  level  routines  is  not  possible  in  Fortran  77  since  dimensional  information  is 
passed  down  to  the  lower  level  routines.  Passing  dimensions  is  often  considered  good  programming 
practice  but  is  the  cause  of  some  awkward  programming  idioms.  A  more  desirable  alternative  is  an 
include  file  containing  all  of  the  problem  dimensions  but  this  necessitates  complete  recompilation 
for  each  set  of  dimensions. 

4  SOFTWARE  STRUCTURE 

The  annotated  indented  call  structure  is  shown  in  Figure  1.  Note  that  a  routine  is  listed  more 
than  once  if  calls  appear  in  the  program  text  in  more  than  one  place  and  that  its  entire  subtree  is 
reproduced  at  each  call. 


TWUe  1  —  Call  Tree  of  Oti^al  Code 


Call  Tree 

Function 

main 

Entry  point  routine 

•  dimens 

Define  dynamic  dimensions  for  run 

•  •  getint 

Read  integer  from  standard  input 

*  •  •  strlen 

Find  trimmed  length  of  string 

•  top82 

IVue  main  routine 

*  •  define 

Define  data  set,  model,  and  run  parameters 

•  •  •  vgrid 

Calculate  vertical  grid 

.  •  •  8phgd2 

Calculate  parameters  for  horisontal  lat-lon  spherical  grid 

•  •  •  pxntpar 

Print  out  data  set  parameters 

•  •  Ismask 

Define  the  land-sea  mask  (boundary) 

•  •  tin 

Initialize  temperature  field 

'  •  ’  8trcc3 

Concatenate  three  strings  together 

•  •  ’  •  strlen 

Find  trimmed  length  of  string 

•  •  •  bgsrd 

Read  3D  data  file 

•  •  •  tinitl 

Define  initial  3D  temperature  field 

•  •  •  getenv 

Get  environment  variable  value 

.  .  .  rdhdf 

Dummy  routine  for  reading  hdf  files 

•  •  sin 

Initialize  salinity  field 

•  •  •  strcc3 

Concatenate  three  strings  together 

•  •  •  •  strlen 

Find  trimmed  length  of  string 

•  •  •  bgsrd 

Read  3D  data  file 

.  .  .  sinitl 

Dummy  routine  for  initial  salinity  field 
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CaUTree 

Function 

•  •  •  tsrel 

Define  initial  3D  salinity  field  via  T-S  relation 

getenv 

Get  environment  variable  value 

•  ■  ■  rdhdf 

Dummy  routine  for  reading  hdf  files 

saladD 

Adjust  salinity  profile 

•  •  •  •  saladj 

Adjust  salinity  profile 

.  intarp 

Perform  linear  interpolation 

.  pdanav 

Calculate  density  anomaly  for  seawater 

.  coefex 

Calculate  coefficients  of  expansion  of  seawater 

■  •  data 

Write  initial  temperature  and  salinity  fields  to  DA  file 

•  •  •  darite 

Write  3D  field  or  section  to  DA  file 

•  •  rin 

Initialize  auxiliary  fields 

•  •  setup 

Perform  setup  calculations  for  TOPS 

•  •  •  extpro 

Calculate  solar  extinction  profile  for  mixed  layer 

•  •  •  •  extjerl 

Interpolate  an  extinction  profile  for  solar  radiation 

.  intarp 

Perform  linear  interpolation 

•  •  •  •  axtnual 

Calculate  solar  flux  extinction  (Mueller  and  Lange) 

.  extir 

Compute  extinction  with  depth  of  III  spectrum 

.  .  .  .  extaorl 

Compute  solar  flux  attenuance  by  pigment  concentration 

.  axtir 

Compute  extinction  with  depth  of  IR  spectrum 

. intxpb 

Perform  linear  interpolation 

•  •  prntgrd 

Print  parameters  associated  with  horizontal  or  vertical  grids 

•  •  •  pzntd 

Print  all  or  part  of  2D  integer/real  array 

•  •  ain 

Define  initial  Ekman  velocities  and  mixed-layer  depth 

.  .  .  surfbc 

Calculate  surface  boundary  conditions 

.  .  .  .  sbctat 

Calculate  forcing  for  test  case  12 

.  .  .  .  abcsbp 

Dummy  routine  for  atmospheric  forcing 

•  •  •  •  sbcnog 

Dummy  routine  for  atmospheric  forcing 

•  •  •  •  sbcacm 

Dummy  routine  for  atmospheric  forcing 

•  •  •  '  darita 

Write  3D  fidd  or  section  to  DA  file 

•  •  <  iakman 

Initialize  Ekman  velocity  field 

•  •  •  •  buoy 

Calculate  buoyancy  from  termperature  and  salinity 

.  .  output 

Output  model  fields 

.  .  .  tiaapr 

Print  time  in  hours,  days,  etc 

•  •  •  print 

Print  out  model  fields  interactively 

•  •  •  •  pzntsl 

Print  section  of  3D  field 

.  sacdaf 

Interactively  get  3D  section 

. pzntsj 

Print  section  of  3D  field 

Print  section  of  3D  array  in  integer  format 

Call  Tree  I  Function 


•  •  •  •  secdef 

Interactively  get  3D  section 

•  •  •  •  daread 

Read  sections  of  fields  from  DA  file 

•  •  •  •  pmtsj 

Print  section  of  3D  field 

.  prnts 

Print  section  of  3D  array  in  integer  format 

•  •  •  getreal 

Read  real  number  from  standard  input 

.  strlen 

Find  trimmed  length  of  string 

•  •  ■  •  mldSam 

Estimate  MLD  field  from  3D  termperature  field 

•  •  •  •  pmtc 

Print  all  or  part  of  2D  integer  array 

•  •  ■  •  profil 

Print  vertical  profiles  at  a  single  point 

.  pdenav 

Calculate  density  anomaly  for  seawater 

•  •  •  •  getint 

Read  integer  from  standard  input 

.  strlen 

Find  trimmed  length  of  string 

•  •  •  •  snitch 

Swap  two  integers 

•  •  •  prbgt 

Print  regionally  averaged  heat  and  salt  budgets 

•  •  •  •  avellB 

Find  area  average  for  a  data  field 

.  sphgdn 

Calculate  weights  for  statistics  on  lat-lon  grid 

•  •  •  •  daread 

Read  sections  of  fields  from  DA  file 

•  •  surfbc 

Calculate  surface  boundary  conditions 

•  •  •  sbctst 

Calculate  forcing  for  test  case  12 

•  ■  •  sbcshp 

Dummy  routine  for  atmospheric  forcing 

•  •  •  sbcnog 

Dummy  routine  for  atmospheric  forcing 

•  •  •  sbceca 

Dummy  routine  for  atmospheric  forcing 

•  •  *  darite 

Write  3D  field  or  section  to  DA  file 

•  ’  geonel 

Calculate  geostrophic  currents 

•  •  •  geovelb 

Calculate  pressure 

•  •  cirvel 

Calculate  circulation  currents 

•  •  •  getcohis 

Read  history  files  from  OCEANS 

•  •  •  ertendhad 

Extend/extrapolate  2D  field  to  fill  in  unknown  values 

•  •  •  cirvelc 

Calculate  pressure 

•  •  adwel 

Calculate  total  advection  velocity 

•  •  davel 

Write  geostrophic  circulation  and  total  velocities  to  file 

•  •  •  darite 

Write  3D  field  or  section  to  DA  file 

•  •  cfl 

Calculate  CFL  parameters  for  advection  and  diffusion 

•  •  ttpdat2 

Update  fields 

•  •  •  adndlf 

Calculate  rate  of  change  of  field  due  to  several  causes 

.  .  .  budget 

Calculate  cumulative  budget  for  temperature  and  salinity 

.  .  .  aod2 

Calculate  vertical  eddy  coefficients 

•  •  •  •  buoy 

Calculate  buoyancy  from  termperature  and  salinity 
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Call  Tree 

Function 

hann3 

•  •  •  trid 

•  •  switch 

■  •  conad 

■  •  •  buoy 

Apply  Hanning  3-point  smoother  to  an  array 

Solve  tridiagonal  system  of  linear  equations 

Swap  two  Integers 

Perform  convective  adjustment  of  density  field 

Calculate  buoyancy  from  termperature  and  salinity 

5  CONVERSION 

This  section  discusses  the  process  used  to  convert  the  Fortran  77  serial  version  of  TOPS  to  a 
Fortran  90  (or  CM  Fortran)  parallel  TOPS.  Section  5.1  discusses  the  basics  of  the  code  conversion. 
In  Section  5.2,  a  practice  called  partUlel  extension  is  described  and  an  example  is  shown.  The  most 
significant  changes  to  TOPS  related  to  the  tridi^onal  solver.  Section  5.3  discusses  the  underlying 
theory  used  and  shows  the  changes  in  tridiagonal  setup  in  order  to  solve  all  tridiagonal  systems  in 
parallel.  The  section  closes  with  a  discussion  of  verification  methodology. 

5.1  Basic  Conversion 

In  the  conversion  of  the  OCEANS  model  (see  [7]),  the  most  difficult  conversion  problems  came 
during  integration  of  the  routines.  Since  the  basic  functionality  of  the  routines  had  already  been 
established  to  a  high  confidence  level,  the  problems  were  known  to  lie  in  the  main  routine  or  in  the 
interface  between  the  main  and  subordinate  levels. 

Most  problems  were  traced  to  layout  differences  between  arrays  declared  in  the  main  routine 
and  passed  to  a  subroutine  and  the  subroutine  formal  parameter  declarations.  There  were  no 
available  mechanisms  at  the  time  to  detect  these  errors  except  direct  visual  checking.  A  degree  of 
run-time  checking  is  present  in  the  current  compiler  version  but  the  reliability  and  thoroughness  is 
not  clear. 

One  metnod,  use  of  interface  blocks,  has  since  emerged  as  a  way  to  detect  these  mismatches. 
In  the  TOPS  conversion  interface  blocks  were  used  to  check  interfaces  at  compile  time.  In  order 
to  make  interface  blocks  easy  to  maintain,  a  un^e  include  file  was  constructed  which  defined 
the  interfaces  for  all  routines.  This  proved  to  be  very  convenient  and  helpful  in  that  interface 
problems  never  became  a  major  source  of  run-time  bugs.  I  few  minor  anomalies  were  noted 
in  the  use  of  interface  blocks,  particularly  in  practices  which  involved  implicitly  passing  array 
sections.  Although  these  practices  are  common  in  sequential  Fortran  77  usage,  they  are  somewhat 
questionable  in  Fortran  90.  The  use  of  an  include  file  minimizes  software  maintenance  effort  and 
the  entire  interface  block  file  can  be  replaced  by  commented  lines  if  the  interface  block  concept  is 
not  supported  or  not  desired. 

As  in  the  OCEANS  etmversion,  common  blocks  were  moved  to  include  files.  TOPS,  however, 
does  not  rely  on  common  blocks  in  a  ugnificant  way  and  uses  the  technique  of  passing  arguments 
thtongh  the  formal  parameter  list.  As  mentioned  in  Section  3,  the  TOPS  practice  of  passing 
dimensions  was  occarionaUy  awkward. 

In  general  the  conversion  of  TOPS  retained  the  same  module  structure  as  in  the  Fortran  77. 
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The  primary  deviations  came  from  the  process  of  parallel  extension  (see  Section  5.2)  were  where 
it  was  sometimes  necessary  to  retain  a  serial  version  as  well  as  the  new  parallel  version . 

Much  of  the  code  operates  on  the  interior  of  the  horizontal  extent  with  horizontal  subscripts 
nuining  from  2  through  n-1  and  2  through  m-1.  This  creates  visually  cluttered  code  because  section 
designators  must  be  spedhed  for  each  array  usage.  There  is  also  a  small  run-time  cost  to  calculate 
the  masks.  A  better  approach  is  to  use  a  mask  which  is  .TRUE,  in  the  interior  and  .FALSE,  on 
the  boundary.  When  used  in  the  WHERE  statement,  cleaner  code  is  produced  with  the  potential  for 
executing  slightly  faster  on  the  CM-5.  An  easy  method  of  handling  these  masks  is  to  place  them 
in  common  blocks  and  have  the  routines  use  them  as  needed.  The  problem  with  this  in  TOPS  is 
the  previously  mentioned  practice  of  passing  dimensional  information  through  the  parameter  list. 
The  masks  in  common  must  have  static  dimensions  but  the  compiler  cannot  know  or  assume  that 
the  passed  dimensions  of  array  arguments  are  the  same  as  the  common  mask  arrays.  An  alternate 
is  to  explicitly  pass  the  mask  arrays  to  the  lower  level  routines  but  this  requires  changes  to  almost 
all  parameter  lists  and  gives  a  somewhat  cluttered  style.  It  should  be  noted  that  in  the  current 
code  masks  are  frequently  generated  as  local  temporaries.  This  can  be  considered  a  compromise 
solution  although  many  routines  still  contain  explicit  sectioning. 


5.2  Parallel  Extension 

Many  routines  in  the  Fortran  77  TOPS  operated  on  a  single  point  or  a  single  column,  an  approach 
lacking  parallelism.  In  most  cases,  and  in  all  cases  occurring  within  the  time  step  loop,  the  routines 
were  extended  to  operate  over  all  points  in  a  vertical  plane  or  on  all  columns.  The  approach  chosen 
in  the  serial  code  often  reflects  the  fact  that  the  operations  were  applied  only  to  a  subset  of  points 
or  columns.  In  the  parallel  versions,  a  mask  is  supplied  which  selects  which  elements  are  to  be 
operated  on  by  the  routine. 

Figure  6  shows  a  typical  routine  from  the  Fortran  77  version.  In  this  case  the  routine  operated  on 
a  sinj^e  point  at  a  time,  applying  a  pair  of  formulas  to  the  scalar  values  t  and  a,  water  temperature 
and  salinity. 

The  parallel  routine  is  shown  in  figure  7.  Dimensional  information  has  been  added  (parameters 
ni  and  aj)  as  well  as  a  controlling  mask  array.  The  t  and  s  parameters  are  now  entire  horizontal 
extents.  The  expressions  in  lines  24  and  25  of  the  figure  are  now  array  operations  and  the  result 
parameters  alphag  and  betag  are  arrays. 

This  example  also  illustrates  the  practice  of  setting  inactive  elements  (according  to  the  mask)  to 
zero  so  that  they  at  least  have  valid  values,  occasionally  a  helpful  situation  during  debugging.  The 
Connection  Machine  layout  directives  are  shown  in  lines  8-10.  The  interface  blocks  are  included 
in  line  12. 


5.3  IVidiagonal  Solver  TRID 

This  routine  oripnally  solved  a  single  tridiagonal  system  of  order  k  where  the  first  k  levels  of  the 
history  variables  represent  the  mixed  layer.  Since  the  depth  of  the  mixed  layer  potentially  differs 
in  eadi  column,  a  method  was  needed  to  solve  a  large  number  of  tridiagonal  systems  of  varying 
order  in  parallel. 
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Fig.  7  —  F.»m»npU  of  PsnUel  Exteunon — ^Aftei 
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The  method  chosen  was  to  imbed  the  individual  hth-order  systems 


Ax  =  b 

within  larger  systems,  all  of  a  constant  fixed  order  in  such  a  way  that  the  result  is  essentially  the 
same.  This  allows  solving  mtdtiple  instances  in  parallel  using  a  simple  extension  of  the  original 
single  system  solver.  The  system  being  solved  is,  of  course, 

(o  /)(j)  =  (j)  <') 

where  y  is  a  vector  containing  the  values  of  the  history  variable  (temperature  or  salinity  's  ample) 

below  the  mixed  layer. 

There  are  two  problems  with  this  approach.  First,  the  aimount  of  work  is  greater  since  a 
larger  system  is  solved  in  each  column  than  before.  In  past  experiences,  it  was  advantageous  to 
trade  additional  floating  point  operations  to  reduce  communications  or  to  maintain  a  high  level 
of  parallelism.  New  capabilities  in  the  Connection  Machine  CM-5  system  software  now  permit  an 
alternative  approach.  See  section  7  for  more  details. 

The  second  problem  relates  to  the  condition  numbers  of  the  modified  tridiagonal  systems  which 
potentially  differ  from  the  condition  numbers  of  the  ori^al  systems.  The  condition  number  of  a 
linear  system  of  equations  .Az  =  b  is  related  to  the  singular  value  decomposition  of  the  matrix  A. 
The  condition  number  reflects  how  sensitive  the  solution  of  Az  =:  b  is  to  errors  in  the  elements 
of  A.  A  system  with  a  high  condition  number  may  produce  inaccurate  results.  Some  algorithms 
misbehave  for  ill-conditioned  systems. 

The  ringnlar  value  decomposition  is  a  factorization  of  the  matrix  into  a  product  of  three  matrices 

A  =  VUV'^  (2) 

where  U  and  V  are  orthogonal  and  S  is  diagonal.  T  Alternatively,  using  the  orthogonality  of  U 
and  V, 

U^AV  =  E.  (3) 


The  matrix  S  has  n  positive  ringnlar  values  vi  to  of  non-increasing  magnitude  where  the 
rank  of  the  matrix  is  n.  If  the  order  of  the  matrix  is  greater  than  n,  E  contains  trailing  zeros  to 
give  E  the  same  order  as  A.  The  factorization  always  exists  for  any  rectangular  matrix. 


The  condition  number  of  a  matrix  A  is  defined  in  terms  of  the  singular  values  of  A  to  be 


■  Ml)- 


(<) 


That  is,  the  condition  number  is  the  ratio  of  the  largest  and  smallest  singular  values. 


The  condition  number  of  an  embedded  system 
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>  1 
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>  1 

<  1 
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>  1 

>  1 
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Fig.  8  —  Relktioaahip  of  <(a(.A)  to  <ca(X) 


is  not  the  same  as  the  condition  number  of  A.  To  reveal  this  relationship  observe  that 

/I  0  0\  //  0  0\  fl  00]  //  0  0\ 

Otr^O  OAO  OVO  =  0  ir^AV  0 

VO  0  l)  \o  0  I)  \o  01)  \0  0  l) 

(10  0 

=  0  E  0 

V  0  0  / 


(5) 

(6) 


The  above  demonstrates  that  the  set  of  singular  values  of  A  is  the  same  as  the  set  of  singular  values 
of  A  with  the  addition  of  the  singular  value  1.  Thus,  the  condition  number  of  >4  is 


«a(i) 


‘^n(A) 

max(l,gi(i4)) 

min(l,a„(i4))‘ 


(7) 

(8) 


There  are  several  possibilities  for  K3(A),  summarized  in  the  following  table.  The  singular  values 
of  typical  A  matrices  used  in  the  TOPS  tridiagonal  routine  have  not  been  determined  so  that  it  is 
unkuawn  wluch  case  in  figure  8  applies. 

In  terms  of  TOPS  arrays,  the  augmented  tridiag<mal  system  is 

( )  ( ^ ) '  ( 0 
where  kt  is  the  index  of  the  first  vertical  level  past  the  mixed  layer. 

The  setup  code  from  the  Ibrtran  77  serial  TOPS  version  is  shown  in  figure  9.  The  variable  kt 
is  the  index  of  the  first  pmnt  bdow  the  mixed  layer  and  the  variable  ktal  is  just  one  less  than  kt. 
The  snbsciipt  k  indexes  the  vertical  dimension  of  a  column  and  an,  bs^  and  oa  are  vectors  for  the 
lower,  main,  and  upper  diagonals  of  a  nn^  tridiagonal  system. 

.Figure  10  shows  the  paraUd  code.  The  comidexity  of  the  new  code  appears  high  at  first 
hut  is  really  £uriy  sim^  once  the  programming  tediniqne,  used  often  in  the  parallel  version,  is 
understood.  The  lo<q;>  In  fines  1  through  7  simply  sets  the  three  diagonals  of  all  active  columns  to 
the  identity  The  mask  tq^dnaak  selects  the  columni  of  the  interior  of  the  region.  The  initial  values 
of  the  main  and  upper  diagonals  are  set  In  fines  8  through  11.  Note  that  only  cdumns  with  at  least 
two  elements  In  the  mixed  layer  are  affected.  The  loop  from  line  12  to  line  18  is  perhaps  the  most 
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Fif.  9  —  TtidiaigoMJ  Setnp — Befoie 


interetting.  The  Tertical  index  k  sweeps  throuhont  its  total  possible  range  (all  possible  interiors  of 
the  tridiagonal  systems)  and  the  WHERE  statement  in  line  13  insures  that  only  columns  that  should 
be  updated  are  modified.  As  k  increases  toward  its  maximum  value,  fewer  and  fewer  columns  will 
be  sdected.  Lines  19  through  24  have  a  similar  structure  to  the  preceding  loop  and  here  simply  set 
the  final  values  in  the  lower  and  main  diagonals.  Note  that  the  WHERE  statement  sweeps  through 
an  possible  vertical  indices  and  selects  no  more  than  one  index  in  any  column.  Finally,  all  of  the 
tridiagonal  systenu  are  solved  together  in  line  26. 


5.4  Verification 

The  software  structure  of  the  original  model  was  used  as  the  basis  for  the  verification  of  the 
conversion.  The  module  groups  were  converted  and  tested  in  groups.  The  groups  were  based  on 
order  of  executicm  including  two  initialisation  groups  and  the  time  step  group.  Multiple  tests  were 
perfetmed  to  the  extent  posriUe  for  ahemative  paths.  At  a  minimum,  Mdi  converted  routine  and 
all  major  brandies  within  were  executed  at  least  once. 

Li  contrast  to  the  approach  used  in  OCEANS  the  test  data  and  verification  were  carried  out 
using  the  serial  code  as  a  generator.  The  serial  code  was  changed  by  the  addition  of  code  to  write 
out  data  generated  by  the  serial  code.  This  was  used  both  as  input  and  as  output  to  be  verified. 
This  method  was  chosen  over  the  random  inputs  used  for  OCEANS  because  it  was  believed  that 
the  inputs  would  be  better  test  cases  for  the  converted  code. 

The  arrays  of  data  generated  were  verified  riement  by  element.  This  again  differs  from  the 
OCEANS  canverrion  process  where  a  statistical  verification  approach  was  used.  It  has  been  found 
that  the  statistical  approach  often  misses  errors  and  is  not  helpful  in  identifying  where  errors  occur 
even  when  detected  in  the  statistics.  Agreement  to  5  significant  figures  was  generally  obtainable 
atthough  problems  were  encountered  deciding  when  to  employ  relative  error  tests  versus  absolute 
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do  k-1.1 

«h«r«  (updmaak) 
amCk. : . «  0.0 
bm(k.:.:)  «  1.0 
ca(k.:,:)  «  0.0 
and  share 
and  do 

share  (updnaak  .and.  kt  .gt.  2) 
cm(l.:,:)  >  -dtzmb2(l)*zkmh(2, : , 
had,:,:)  ■  l.-cn(l,:,:) 
end  shore 
do  k>2,l>2 

shore  (k  .le.  kt-2  .and.  updBaak  .and.  kt  .gt.  2) 


aa(k,:,:)  ■  >dtzad>l(k)*zkBh(k, : , :) 
c«(k,:,:)  ■  •*dtzab2(k)*zkadi(k4^1, ; , :) 
ha(k,:,:)  ■  l.-aa(k,:,:)-aa(k.:,:) 

end  shore 
end  do 
do  k«2,l-l 

shore  (k  .e^.  kt-l  .and.  i^Klamak  .and.  kt  .gt.  2} 
aa(k,:,:}  ■  -dtzBbl(k}*zkBh(k, : , :} 
haCk.:,:)  ■  l.-aaCk,:,:) 

OBd  shore 
end  do 

call  tridCii.Kfl.aa,ha,ca,u2,updBa8k  .and.  kt  .gt.  2) 


R«.  10  —  Itidiacoaal  Setsp — ^Aftcr 
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error  tests.  In  one  portion  of  the  code,  the  routine  UPDAT2  and  the  called  routine  MQD2,  it  was 
impossible  to  duplicate  the  Sun  Fortran  77  results  on  the  CM-5.  In  further  testing,  it  was  found 
that  Convex  single  precision  results  also  differed  from  the  Sun  and  the  CM-5.  Eventually,  double 
precision  Convex  results  were  verified  with  double  precision  CM-5. 

6  PERFORMANCE 

The  conversion  of  TOPS  was  carried  out  only  to  the  point  where  correct  answers  were  obtained 
through  a  number  of  time  steps.  As  the  result  of  changing  priorities,  further  work  on  the  parallel 
version  of  TOPS  has  been  delayed  and  comparative  timings  in  particular  remain  undone. 

The  projected  performance  of  TOPS  on  the  Connection  Machine  is  expected  to  be  good.  No 
unusual  problems  were  encountered  that  required  general  communication.  No  attempt  has  been 
made,  however,  to  collect  together  repeated  communications  operations  as  was  done  with  the 
OCEANS  code. 

One  final  performsmce  comment  is  appropriate.  The  relative  performance  of  TOPS  will  increase 
with  larger  grid  sizes  (higher  resolutions).  The  largest  resolution  which  seems  appropriate  for  TOPS 
is  about  768  by  384  by  51  points.  Most  parallelism  is  in  768  by  384  layers.  This  grid  size  is  only 
moderate  for  the  Connection  Machine,  and  is  significantly  smaller  than  the  approximately  2000 
by  1000  grid  size  used  in  OCEANS.  Performance  is  thus  expected  to  be  somewhat  less  than  that 
achieved  by  OCEANS  running  on  the  CM-5.  For  details  on  the  performance  of  OCEANS,  see  [7]. 

7  FUTURE  WORK 

The  TOPS  conversion  was  overtaken  by  chan^ng  priorities  and  was  not  completely  finished 
although  the  model  appears  to  be  producing  correct  results.  A  number  of  generally  small  changes 
could  be,  and  perhaps  should  be,  undertaken  to  move  the  modd  to  a  more  usable  state. 

Fint,  the  Thinking  Machines  compiMr  restrictions  on  serial  axes  have  been  removed  and  hence 
the  axes  could  be  reordered  into  a  more  standard  Fortran  ordering.  This  change  is  relatively 
strv^tforward  to  carry  out  but  involves  changing  almost  every  line  of  code. 

Onoe  the  axes  have  been  reordered,  a  new  run  time  feature  called  array  aliasing  can  be  em¬ 
ployed.  This  is  most  useful  as  a  means  of  improving  the  efficiency  of  communication  operations, 
exdusivi^  CSHIFT  operatioiu  in  TOPS.  Tlus  may  yield  a  significant  improvement  but  the  long 
term  improvement  may  disappear  as  the  compiler  and  run  time  system  are  improved.  Additionally, 
this  feature  is  definitdy  non-portable. 

A  minor  change  is  removal  of  dead  code.  Most  of  the  existing  TOPS  Fortran  77  code  is  still 
present,  commented  out,  in  the  source  code.  This  is  useful  during  debugging  but  it  does  detract 
from  the  readability  of  the  finished  code.  This  change  is  pervasive  but  cosmetic  only.  Despite  the 
posdbffity  of  introdudng  errors  by  acddentaUy  deletixig  live  code,  the  final  product  is  worthwhile. 

The  tTidisgrtwml  solver  is  also  an  area  where  improvements  can  be  made.  First,  it  is  possible  to 
reduce  the  order  of  the  tiidiagonal  systems  solved.  Currently  the  size  is  equal  to  the  total  vertical 
diwMiHsinfB.  This  is  probably  highly  omservative  and  it  could  be  reduced  with  little  effort  to  the 


maximum  size  needed  over  all  active  columns.  This  is  likely  to  yield  a  useful  improvement  but  it 
typical  magnitude  is  unknown. 


Also  in  the  tridiagonal  solver,  a  new  compiler  capability  called  local/global  mode  could  be 
exploited.  Fortran  90  code  normadly  runs  in  what  is  called  global  mode  where  operations  are 
synchronized  at  a  statement  level.  In  local  mode  a  subroutine  executes  independently  on  its 
own  local  data,  essentially  in  MIMD  mode.  Applying  this  to  the  tridiagonal  solver,  each  CM-5 
processing  node  can  solve  all  of  its  local  tridiagonal  systems  using  whatever  method  desired  and 
the  total  running  time  is  simply  the  maximum  time  needed  by  any  processing  node.  Local  mode 
is  not  part  of  Fortran  90  or  High  Performance  Fortran  and  so  this  change  would  not  be  portable. 

Some  of  the  code  in  TOPS  was  not  converted  because  it  was  not  used  in  the  test  case.  Perhaps 
the  most  important  of  these  was  routine  CONAD.  This  routine  has  been  preliminarily  converted  but 
not  debugged.  Code  conversion  was  the  most  difficult  in  all  of  TOPS.  Discussion  with  code  author 
Paul  Martin  has  led  to  the  conclusion  that  a  revised  formulation  could  be  the  best  longer  term 
solution  since  performance  of  COHAD  as  currently  converted  would  not  appear  to  be  especially  good. 

The  final  code  change  is  a  relatively  minor  one.  TOPS  has  a  direct  access  file  in  which  history 
variables  and  key  intermediaries  are  stored  for  output  and  debugging  purposes.  The  time  to  write 
this  file  is  a  significant  part  of  the  current  parallel  run  time  and  improvements  in  this  area  would 
be  beneficial  to  the  overall  run  time.  The  improvement  can  come  from  using  the  CM-5  Scalable 
Disk  Array,  a  parallel  file  system,  in  one  of  several  ways.  On  the  other  hand,  the  frequency  with 
which  the  file  is  written  is  user  controlled  and  need  not  occur  often  in  practice. 

Finally,  TOPS  needs  to  be  timed  on  larger  test  cases  and  performance  problems  fixed.  This 
will  allow  direct  and  meaningful  performance  comparisons  between  the  parallel  version  and  the 
serial  version.  There  is  much  external  Navy  interest  in  these  comparisoiu. 


8  CONCLUSIONS 


The  code  conversion  of  the  model  was  relatii^y  straightforward  although  not  as  simple  as  in 
the  OCEANS  modd.  The  conversion  process  introduced  generally  systematic  changes  in  the  data 
structures  because  the  slab  teduuque  used  in  the  Fortran  77  version  wm  extended  to  fully  parallel 
for  the  Connection  Machine. 

The  performance  the  paralld  version  was  not  tested  although  no  evident  performance  prob¬ 
lems  were  encountered. 

Verification  of  the  code  was  problematical.  It  was  finally  discovered  that  a  portion  of  the  code 
was  sensitive  to  round  off  error  in  sinfd^  prednon  and  hence  verification  was  not  possible  until 
both  serial  and  paiaUd  codes  were  run  using  64  bit  arithmetic. 

Chan^ttg  iniorities  have  left  the  TOPS  conversion  at  a  preliminary  operational  levd.  There 
are  a  number  of  rdativdy  small  activities  which  need  to  be  done  for  TOPS,  particularly  timings 
on  hi|^  resolution  grids. 


18 


9  ACKNOWLEDGMENTS 


The  authors  wish  to  thank  Dr.  Paul  J.  Martin  for  his  assistance  with  the  conversion  of  TOPS. 
We  also  thank  Dr.  Henry  Dardy  of  the  NRL  Connection  Machine  Facility,  Dr.  Steve  Piacsek  of 
NRL-Stennis,  and  Dr.  Robert  A.  Peloquin  and  Elizabeth  Wald  of  the  Office  of  Naval  Research  for 
their  support. 


10  REFERENCES 

1.  “Connection  Machine  Model  CM-200  Technical  Summary,”  Thinking  Machines  Corporation, 
June  1991. 

2.  “Connection  Machine  Model  CM-5  Technical  Summary,”  Thinking  Machines  Corporation, 
Jan.  1992. 

3.  “CMSSL  for  CM  Fortran,”  Thinking  Machines  Corporation,  Jan.  1992. 

4.  “American  National  Standards  for  Information  Systems  Programming  Language  Fortran,” 
Americsm  National  Standards  Institute  (in  press). 

5.  “High  Performance  Fortran,”  Available  from  C.  H.  Koelbel,  Rice  University  Computer  Science 
Department,  Houston,  TX. 

6.  Sela,  J.  G.,  Anderson,  P.  B.,  Norton,  D.  W.,  Young,  M.  A.,  Massive  Parallelization  of  NMC’s 
Spectral  Model,  Journal  of  Parallel  and  Distributed  Computing,  to  appear. 

7.  Anderson,  P.  B.,  Young,  M.  A.,  Connection  Machine  Software  Conversion  of  a  Navy  Oceans 
Model,  NRL  memorandum  report  7400,  1993. 


19 


